-
Notifications
You must be signed in to change notification settings - Fork 763
migration for inactive user purge #6676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
migration for inactive user purge #6676
Conversation
utils_module = importlib.import_module('kitsune.users.utils') | ||
delete_user_pipeline = utils_module.delete_user_pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the need for this. I think it's safe to do from kitsune.users.utils import delete_user_pipeline
outside of this function.
last_id = 0 | ||
while True: | ||
batch = list( | ||
query.filter(id__gt=last_id) | ||
.order_by('id') | ||
.annotate(has_content=has_content_criteria) | ||
[:batch_size] | ||
) | ||
if not batch: | ||
break | ||
last_id = max(user.id for user in batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could replace all of this and gain efficiency (fewer queries) by using the iterator
method that Django provides (as we do in 0034_batch_delete_non_migrated_users.py). So something like:
current_batch = []
for user in query.annotate(has_content=has_content_criteria).iterator(chunk_size=batch_size):
current_batch.append(user)
if len(current_batch) >= batch_size:
# Do the work
...
current_batch = []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked myself out of iterator when I was sorting into two groups; but when I looked at your note below, I am back to being in camp iterator().
users_with_content = [user for user in batch if user.has_content] | ||
users_no_content = [user for user in batch if not user.has_content] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you're iterating over 1k users twice, and then later iterating over the users_no_content
to get their id
's. You could do all of that in one iteration, something like:
for user in current_batch:
if user.has_content:
# run the pipeline
else:
users_no_content.append(user.id)
else:
User._base_manager.filter(id__in=users_no_content).delete()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+wc
# Progress reporting every 1000 users | ||
if processed_count % 1000 == 0: | ||
elapsed_time = time.time() - start_time | ||
progress_pct = (processed_count/total_users*100) if total_users > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Since you've already checked if total_users
is zero above (and returned in that case), you don't need to check again here, so I think you can remove if total_users > 0 else 0
avg_time = elapsed_time / processed_count if processed_count > 0 else 0 | ||
remaining_time = (total_users - processed_count) * avg_time if processed_count > 0 else 0 | ||
current_rate = processed_count / elapsed_time * 60 if elapsed_time > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Same here. I think for all of these lines, the processed_count
and elapsed_time
values are guaranteed to be greater than zero, so I don't think you need the if x > 0 else 0
checks.
@smithellis Forgot to mention one more thing. Just FYI, you can use |
|
||
cutoff_date = timezone.now() - timedelta(days=3*365) | ||
|
||
query = User.objects.filter(last_login__lt=cutoff_date).annotate(has_content=has_content_criteria) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using annotate
here instead of filter/exclude/Exists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case absolutely works - I think it's just not specifically called out in the 4.2 docs but is in later docs. I can run this query and see the output and it builds valid sql which executes properly.
I'm annotating here so we can later divide our users into those with and those without content, so we can execute a quicker delete process on non-content users.
@@ -0,0 +1,131 @@ | |||
from datetime import timedelta | |||
import time | |||
import importlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, leftover from prior change. Weird pre-commit didn't fuss at me.
""" | ||
Delete users who haven't logged in for over three years. | ||
""" | ||
User = get_user_model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be User = apps.get_model("auth", "User")
in migrations similar to the previous one (0034)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because the delete_user_pipeline
function needs an actual User instance vs. a historical model.
1 - get all users who we consider inactive (3 years since login)
2 - divide into users having content and users without content
3 - hard deletes non-content users using the
_base_manager
4 - pushes other users through the deletion pipeline
5 - implements batching to manage resources